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With Light 


A photonics compute platform to address Al demands 


Lightmatter's Mars device accelerates Al workloads 

Core computations performed optically using silicon photonics 

Multi-chip solution to get the best of transistor and photonics technology 
Photonics provides unprecedented opportunities for performance and efficiency 
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Section 1 
Motivation 


Transistors aren't getting more efficient 
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“Moore's law made CPUs 300x 
faster than in 1990...but it's over.” 
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Physical Limit @ 300K 


1970 1980 1990 2000 2010 2020 
Year 


Source: https://github.com/karlrupp/microprocessor-trend-data 


=> Chips are getting hotter 


400 Watts per chip package 
is a practical limit 


Dark Silicon 
You cant use the whole chip at once 


Clock Frequency Saturation 


Exotic Cooling Solutions 


Two Distinct Eras of Compute Usage in Training AI Systems 
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eta 


e-4 


e-6 


e-8 


e-14 Perceptron 


1960 


Electronics won't keep up 


^ Al growth rate 5x Moore's Law 
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2-year doubling (Moore’s Law) 
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Neural Machine 
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3.4-month doubling 


Deep Belief Nets and 
layer-wise pretraining | P 
DQN 
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BiLSTM for Speech 
e 
LeNet-5 


e 
RNN for Speech 


€ First Era Modern Era > 
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Source: OpenAl 


Section 2 
Photonics Core 


Why photonics? 


Fundamental benefits independent of process node 


Less energy spent moving data 


Systolic Array 
Optical tensor core 


— > — Higher clock frequency, less energy, 
same size, lower latency 


Data In — —- Data Out 
(ight em Wavelength- and polarization-division 
multiplexing. N processors, 1 physical 
ENS —— resource. 
| 
Photonic Photonics Electronics Improvement 
Compute 
Unit Cell Latency 100 ps 100 ns 108 
Bandwidth 20 GHz 2 GHZ 10 
Power 1 uW 1 mW 10? 


Area 2500um? © 2500um? 1 


Less energy spent moving data 


L = 3cm 


1 
2 cm Data Flow > > = 25 : ET : eee 


10s of Watts just for passive data transport with electronics. 
-Free with optics...and no length-dependent RC time constant! 


Compute tile 


Electric Field 
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Waveguide 
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MZIs provide observation of phase shift "il EE"!!! 
through interference 
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These become useful when you modulate the 
phase on ®, and $, os MOUs O ë gll 
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Differential Phase 
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> Need an electrically-controlled phase shift for MZI. 
Some examples... 
€ Thermal (slow, consumes static power) 


€ P/N junction (fast but large and lossy) 
€ Mechanical! 


Mars uses Nano Optical Electro Mechanical System 


(NOEMS) 
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Waveguide bends with electrostatic charge 


Capacitance of the NOEMS stores the phase shift setting 
Low loss 


Actuation speed is 100's of MHz 
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Optical Vector MA CI 


Compute tile 


: T : . ising  —cosó 
Directional couplers combine input signals in pairs pe AN a 
Phase shifters provide programmability 


OV MAC computes a 2x2 matrix multiplied by a 1x2 vector 


We just did a 2x2 MVP at the speed of light with near zero power! 


No RC delays, no dynamic power. ap 


Arrays of MZIs 


Vector MZI 


Arrays of 2x2 MZIs provide 
matrix * vector products 
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Data Conversion at Edges of Square 


R Voltage 
output 


Current 


input Voltage 


A Vi 
Transimpedance amplifier 


V amplifier 


Detectors and ADC convert output 
vector to digital 


Analog input Digital output 


m Mars: 64 DACs and 64 ADCs for 4096 MAC operations 


U Performance scales with area — like electronics 


1 Power scales with sqrt(area) — this is interesting! 


Wavelength- and polarization-division multiplexing. N processors, 1 physical resource. 
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1 Processor => ‘N’ Parallel Instances 


Mars Photonics Core 


UD 64x64 Matrix * 64 element vector 
Li 8K ops per “cycle” 
1GHz vector rate Vector 


, : 64x64 Photonic Matrix 
50mW laser Eno : | Weight 


-| Modulators 


8-bit signed operands 

200ps latency 

90nm standard photonics process 
150mm? 
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Section 3 
Digital System 


Mars SoC 


14nm custom ASIC 


50mm? 


Analog interfaces to Photonics Core | Activation 
= Storage 


SRAM for weights and activations 


I/O interfaces 


L LD D D OO 


Digital offloads: 


LJ  Non-linearities (GELU, sigmoid etc.) 


O Scaling and accumulation 


Weight Storage 
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Digital Architecture 


LJ Buffer organization minimizes 
data movement 


LJ Heterogeneous on chip 
networks 


LJ Single fully synchronous 
pipeline scheduler 


VSlice 


Vertical 
Networks and 
Control 


. VO Interfaces 


Activation Buffer 


Activation Buffer 


Weight Cache 


Weight Buffer 


Weight Cache 


Weight Buffer 
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Activation Pipeline 


UL OF CO Ll 


Digital pipeline maintains operand locality 
SRAM access minimized 
True pipelining for small batch performance 


Photonics outputs physically close to inputs 


Activation Storage 


P4 Non Linearity 


713 Accumulation 
=] Scaling 


Fs r4 Linearization 


Photonics MAC Array 
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Weight Updates 


ASIC | 


Photonics Core | 


Vector MZI 
Phase Shifters 
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Weights stored near MZI as charge 

Static power near zero 

Computational dynamic power near zero 
Dynamic power for weight update comparable to 
electronics 


Batching saves SRAM/transport power 


Only pay for data conversion power, not compute 
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Power and Performance 


Mars Power 


Laser 

8.8% 

Misc Photonics 
7.5% 


ADCs 
12.7% 


Static 
0.7% 


Weight DACs 
10.3% 


Vector DACs 
11.3% 


Digital Dynamic 
48.7% 


Under 3W TDP 


Most power is data movement 


ResNet-50 @ 99% accuracy 
(ImageNet, compared to FP32) 
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3D Integration 


LJ Low power compute allows stacking 


O Stacking reduces data movement 


O SRAM is closer to compute 


90nm 
Optical core 


Putting it all together — Software 


Mars Accelerator 


Neural network model Interface with standard deep learning 

opti mization and deployment frameworks, compilers, and model exchange 
formats. 

ML FRAMEWORKS 


Host: 


O PyTorch AR TensorFlow €» ONNX 


ML Framework 
Compiler 
Provides Services for... 


Model Compile & Execute Debug Profile 
Simulate the effects of Map model to hardware Access intermediate state Find and fix performance 
model parameters on Optimize model performance in the model under and resource bottlenecks 
accuracy and performance Execute generated code execution 
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General-purpose Al 


inference acceleration. 
Photonics provides core 
compute. 
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Image Sentiment Machine Recommendation 
Recognition Analysis Translation 
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Summary 
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Optical computing is here - and focused on AI 

Lightmatter's Mars chip leverages photonics for compute, electronics for activation and I/O 
3D stacking brings weights and activations closer to the compute core 

Freedom in power budget allows larger devices and more SRAM 


DUDU 
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